Finding slowdowns and solutions.

“Local Data”

You have data that…

  • You want to analyze: summarize, visualize, model, etc.

  • Can be downloaded somewhere on your local machine.

  • Can be read fully into R…

… but maybe very slowly…

… and maybe only if it’s a parquet file.

What if it’s bigger than that?

  • It’s probably in a cloud database. (e.g. AWS)

  • Do the individual files/tables fit on disk?

  • Can you query subsets of the data and fit those on disk?

Vocab

Key Terms

  • Data is on disk if it is stored on your computer

  • Data is in-memory if you load it into RAM, e.g. loading into R.

  • A csv file is a file type for storing data as comma separated text.

  • A parquet file is a file type for storing data as column information.

Key Packages

  • data.table optimizes calculations in R on data frames, via algorithmic cleverness and C implementation.

  • duckdb creates a SQL database locally and lets you use R Code to execute SQL operations.

  • arrow provides ways to read and write parquet files and to move data around between data.table, duckdb, and other formats.

Helper Packages

  • Friends of data.table:
    • dtplyr, tidyfast for dplyr syntax
    • mlr3 for machine learning
  • Friends of duckdb:
    • duckplyr for for dplyr syntax
    • odbc, for connection to cloud databases
  • Other speed/efficiency helpers:
    • polars in python

Possible Slowdowns

1. My data reads in slowly.

A. Use data.table::fread() instead.

B. Write it to a parquet version; use arrow::read_parquet().

C. Put it in a duckdb; use queries to avoid reading the whole dataset at once.

2. One of my pipelines is a little slow, and I do it many times

A. Are you using vectorized functions (or could you)?

B. Use data.table - do the small speed gains add up?

C. Can you move some subsetting steps to duckdb?

3. One of my pipelines is very slow.

A. Can you re-order the pipeline?

B. Are you doing a split-apply-combine over many groups? data.table!

C. Are you doing a subsetting process? data.table or duckdb!

D. Is it just a lot of data? duckdb and calculate in partitions.

Setup

dat <- read_csv("../data/raw_csvs/person/2021/az/psam_p04.csv")

Step 1. Timing Chunks

The tictoc() package

tic()

  dat |>
    pivot_longer(PWGTP1:PWGTP80,
               names_to = "Weight_Num",
               values_to = "Weight_Amount") |>
    group_by(ST) |>
    mutate(
      max_weight = max(Weight_Amount)
    )
# A tibble: 5,842,000 × 210
# Groups:   ST [1]
   RT    SERIALNO   DIVISION SPORDER PUMA  REGION ST    ADJINC PWGTP  AGEP   CIT
   <chr> <chr>         <dbl> <chr>   <chr>  <dbl> <chr>  <dbl> <dbl> <dbl> <dbl>
 1 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 2 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 3 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 4 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 5 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 6 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 7 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 8 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
 9 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
10 P     2021GQ000…        8 01      00800      4 04    1.03e6    35    36     1
# ℹ 5,841,990 more rows
# ℹ 199 more variables: CITWP <dbl>, COW <dbl>, DDRS <dbl>, DEAR <dbl>,
#   DEYE <dbl>, DOUT <dbl>, DPHY <dbl>, DRAT <dbl>, DRATX <dbl>, DREM <dbl>,
#   ENG <dbl>, FER <dbl>, GCL <dbl>, GCM <dbl>, GCR <dbl>, HIMRKS <dbl>,
#   HINS1 <dbl>, HINS2 <dbl>, HINS3 <dbl>, HINS4 <dbl>, HINS5 <dbl>,
#   HINS6 <dbl>, HINS7 <dbl>, INTP <dbl>, JWMNP <dbl>, JWRIP <dbl>,
#   JWTRNS <chr>, LANX <dbl>, MAR <dbl>, MARHD <dbl>, MARHM <dbl>, …
toc()
3.124 sec elapsed

Step 2. Profiling Processes

profvis::profvis({
  dat |>
    pivot_longer(PWGTP1:PWGTP80,
               names_to = "Weight_Num",
               values_to = "Weight_Amount") |>
    group_by(ST) |>
    mutate(
      max_weight = max(Weight_Amount)
    )
}
)

Step 3. Benchmarking Solutions

Function wrappers (optional)

old_pipeline <-  function() {
  dat |>
    pivot_longer(PWGTP1:PWGTP80,
               names_to = "Weight_Num",
               values_to = "Weight_Amount") |>
    group_by(CIT) |>
    mutate(
      max_weight = max(Weight_Amount)
    )
}

library(dtplyr)

new_pipeline <- function() {
  dat |>
    rowwise() |>
    summarize(
      max_weight = max(PWGTP1:PWGTP80),
      CIT = CIT
    ) |>
    group_by(CIT) |>
    summarize(
      max_weight = max(max_weight)
    )
}

Microbenchmark for multiple runs

microbenchmark::microbenchmark(
  old_version = old_pipeline(),
  new_version = new_pipeline(),
  times = 5
)
Unit: milliseconds
        expr       min        lq      mean    median        uq       max neval
 old_version 1612.0746 1673.3494 1820.7446 1823.9169 1984.6106 2009.7713     5
 new_version  191.5454  250.3799  329.5858  387.5821  408.1263  410.2953     5

bench package for memory comparisons…

bench::mark(
  old_version = old_pipeline(),
  new_version = new_pipeline(),
  check = FALSE,
  max_iterations = 3
)
# A tibble: 2 × 6
  expression       min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 old_version    2.09s    2.09s     0.478    9.45GB    0.956
2 new_version 191.34ms  199.4ms     4.51    15.24MB   28.6  

… and testing scaling with size

results <- bench::press(
  duplications = c(1, 2),
  {
    dat_big <- bind_rows(replicate(duplications, dat, simplify = FALSE))
    bench::mark(
      old_version = old_pipeline(),
      new_version = new_pipeline(),
      check = FALSE,
      max_iterations = 3
    )
  }
)

results
# A tibble: 4 × 7
  expression  duplications      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>         <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 old_version            1    2.04s    2.04s     0.490    9.45GB    0.490
2 new_version            1 194.53ms  251.4ms     3.47    15.24MB   20.8  
3 old_version            2    1.96s    1.96s     0.509    9.45GB    1.02 
4 new_version            2 248.67ms 262.87ms     3.80    15.24MB   20.9  

Summary

  1. Use tictoc() or proc.time() to get a feel for runtimes of larger chunks.

  2. Use profiling to narrow down where in a chunk the slowdowns are.

  3. Use benchmarking to compare your old version to a proposed solution.

  4. Use benchmark testing to see how your speed ups scale with data size.